
# Introduction 

_Data repository for:_

Brian Libgober, Connor T. Jerzak. Linking Datasets on Organizations Using Half-a-Billion Open-Collaborated Records. Political Science Methods and Research: 1-20, 2024. [doi.org/10.1017/psrm.2024.55](https://doi.org/10.1017/psrm.2024.55)

```
@article{libgober2024linking,
  title={Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records},
  author={Libgober, Brian and Connor T. Jerzak},
  journal={Political Science Methods and Research},
  year={2024},
  pages={1-20},
  publisher={}
}
```


# Details 

This repository contains large-scale training data for improving linkage of data on organizations. `NegMatches_mat.csv` and  `NegMatches_mat_hold.csv` refer to millions of negative name matches examples derived from the LinkedIn network (see paper for details). `PosMatches_mat.csv` and  `PosMatches_mat_hold.csv` refer to millions of positive name matches examples derived from the LinkedIn network (see paper for details). 

Additionally, files with saved `*_bipartite` refer to the bipartite network representation of the LinkedIn network that we use for improving linkage. files with saved `*_bipartite` refer to the Markov network representation of the LinkedIn network that we use for improving linkage. 

Finally, data from all examples used in the paper are available in `Example*` folders. In each folder, the `x` and `y` data have linkage variables named `by_x` and `by_y` respectively, as does the merged `z` dataset. 

# Questions 

With any questions, don't hestitate to reach out to `connor.jerzak@gmail.com`.
